Skip to content

Cythonize away some perf hot spots #709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

leofang
Copy link
Member

@leofang leofang commented Jun 13, 2025

Description

Less aggressive version of #677.

Based on the summary in #658 (comment), this PR offers a performance optimization over identified hotspots to bring us to a lot closer with our reference (CuPy). The optimization strategy is to

  • implement everything in pure Python (we do this today for cuda.core)
  • once hotspots are identified, we lower to Cython
  • most importantly, we still call cuda.bindings Python APIs in the Cython code, so as to avoid introducing CTK as a build-time dependency (and therefore having to ship two separate packages cuda-core-cu11 and cuda-core-cu12)

In other words, this PR tries to find a reasonable balance between performance, easy of development, and easy of deployment, without introducing any breaking change.

Preliminary data:

  • cuda.core main branch
In [5]: %timeit e = dev.create_event()
4.65 μs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [5]: %timeit s = dev.create_stream()
7.7 μs ± 9.98 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
  • this PR
In [6]: %timeit e = dev.create_event()
1.11 μs ± 6.91 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [4]: %timeit s = dev.create_stream()
4.12 μs ± 14 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
  • cupy
In [8]: %timeit e = cp.cuda.Event(disable_timing=True)
749 ns ± 5.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [14]: %timeit s = cp.cuda.Stream(non_blocking=True)
3.8 μs ± 8.54 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@leofang leofang added this to the cuda.core parking lot milestone Jun 13, 2025
@leofang leofang self-assigned this Jun 13, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Jun 13, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jun 13, 2025
Copy link
Contributor

copy-pr-bot bot commented Jun 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang linked an issue Jun 25, 2025 that may be closed by this pull request
1 task
@leofang
Copy link
Member Author

leofang commented Jun 25, 2025

/ok to test 227d9c1

Copy link

@leofang
Copy link
Member Author

leofang commented Jun 25, 2025

/ok to test 48de1b3

@leofang leofang closed this Jun 25, 2025
@oleksandr-pavlyk
Copy link
Contributor

/ok to test

Copy link
Contributor

copy-pr-bot bot commented Jun 30, 2025

/ok to test

@oleksandr-pavlyk, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@oleksandr-pavlyk
Copy link
Contributor

/ok to test 0d573fc

To fix test failures with CTK 11.8 and driver 535.247.01
only attempt to query _ctx_handle if _device_id is None.

Ensure that context handle is set in Stream.context property
@oleksandr-pavlyk
Copy link
Contributor

/ok to test 30de720

@leofang leofang changed the title WIP: Cythonize away some perf hot spots Cythonize away some perf hot spots Jun 30, 2025
@leofang
Copy link
Member Author

leofang commented Jun 30, 2025

PR description updated. Thanks to @oleksandr-pavlyk, the CI failure was identified and fixed (the refactoring introduced a call to cuStreamGetCtx, which could not be called during stream capturing with 12.2 driver). This is ready to review.

@leofang leofang marked this pull request as ready for review June 30, 2025 19:12
@oleksandr-pavlyk
Copy link
Contributor

/ok to test e95c4b1

kkraus14
kkraus14 previously approved these changes Jul 1, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to In Review in CCCL Jul 1, 2025
Copy link

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@oleksandr-pavlyk
Copy link
Contributor

/ok to test f4531e5

self._mnff = Event._MembersNeededForFinalize(self, None)

options = check_or_create_options(EventOptions, options, "Event options")
def _init(cls, device_id: int, ctx_handle: Context, options=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Event contains native class members, perhaps adding __cinit__ to initialize them is appropriate. Something like

    def __cinit__(self):
        self._timing_disabled = False
        self._busy_waited = False
        self._device_id = -1

I also think it would be safe to set object class members to None.

This would ensure that Event.__new__(Event) would return an initialized struct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Cython sets everything to None for us, but it'd be good to verify this indeed

Cython additionally takes responsibility of setting all object attributes to None,

https://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#initialisation-methods-cinit-and-init

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's leave object members out. Should I push adding Event.__cinit__ ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same section says all members are zero/null initialized?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but is it appropriate to zero initialize _device_id? Perhaps it does not matter much.

@leofang
Copy link
Member Author

leofang commented Jul 1, 2025

CI is green

Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[FEA]: Faster initialization time for cuda.core abstractions
4 participants